Runtime/namespace/client wide worker heartbeat#983
Runtime/namespace/client wide worker heartbeat#983yuandrew merged 21 commits intotemporalio:worker-heartbeatfrom
Conversation
cretz
left a comment
There was a problem hiding this comment.
PR titled "process-wide" but actually it's "runtime-wide" it seems. But IMO it should be at the client level, where you have the connection to actually make the invocation and is the thing workers share to communicate to server.
Sushisource
left a comment
There was a problem hiding this comment.
As for the client/runtime thing, I think this is pretty preferential and an internal detail that won't matter much to users.
The thing I think that does matter is clients don't "own" workers (and in fact can't, due to circular deps). They are used when creating a worker, but the relationship is more like a worker "has" a client, and more than one worker may have the same client.
That's why I don't really want the heartbeating duration to be a client property. Right now our client options in the lang layer don't directly reference anything worker specific (they do indirectly with interceptors/plugins), and I think it should really stay that way. Specifying a heartbeating duration for a client that is never used with a worker, for example, is weird.
So, that means either this option stays on a Runtime (seems perfectly fine to me) or needs to be passed in when initting the worker, but then there's some last-write-wins problem if users use the same client with different values when passing to workers. So, more reason to put it on the Runtime.
As for literally where this map lives or not - I don't have a huge preference either way. Changing the trait that was used to get slot suppliers for eager to be more generic & used for this I can see. I'm also fine with the current setup.
Right, they are associated with them. That's the point. Same for eager workflow start. A worker is associated with a single client, not some other client in the process/runtime. @Sushisource - to confirm, you don't believe that we need to make sure a worker's heartbeat should occur on the worker's client? I think that is much simpler to understand for those that want to understand what calls use what clients. This is important because users control lifetimes of clients and handle explicit client replacement to update auth and are the ones that control which clients are for which workers. I still can't really see a reason not use a worker's client to make calls relating to that worker (or set of workers that share the client). |
Sure, I think that's fair (though the actual scenario in the current design where that wouldn't happen seems to be exceptionally rare). I maintain the config option shouldn't live in the client options, though. |
Do you believe I as a user need to make a whole new Tokio thread pool if I want to disable worker heartbeating only for some workers? Not a rhetorical question, I don't mind if the answer is "yes, should be rare". |
Technically you wouldn't even have to, right now (at least how this is exposed in Core - you might have to depending on how lang does it). You can make a new CoreRuntime without constructing a new Tokio runtime, but rather re-using an existing one. So, we can support that. But, yeah even if we don't I'm not hugely concerned. |
9d39b9f to
e08370e
Compare
e08370e to
d324044
Compare
198f5b8 to
0fc4e05
Compare
| /// Optional worker heartbeat interval - This configures the heartbeat setting of all | ||
| /// workers created using this runtime. | ||
| #[builder(default = Some(Duration::from_secs(60)))] | ||
| heartbeat_interval: Option<Duration>, |
There was a problem hiding this comment.
I still think it's unnecessary to put this on the runtime instead of the client, but it's not a big deal. It disallows hybrid users (e.g. users that connect to self hosted and cloud) from disabling/adjusting heartbeats specific to their environment. From a lang POV they'll have to create an entirely new thread pool to be able to do so. It seems like the only benefit to putting this value on runtime instead of client is "feels" IIUC.
Having said that, not a big deal, up to y'all.
There was a problem hiding this comment.
Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.
The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.
One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.
cretz
left a comment
There was a problem hiding this comment.
All looks great to me. Only concern is enabling it for everyone by default.
| telemetry_options: TelemetryOptions, | ||
| /// Optional worker heartbeat interval - This configures the heartbeat setting of all | ||
| /// workers created using this runtime. | ||
| #[builder(default = Some(Duration::from_secs(60)))] |
There was a problem hiding this comment.
IMO we shouldn't turn on worker heartbeating by default at this time. IMO we should get it into langs, make sure it works as we expect, has no averse effects, etc then consider turning it on by default. Arguably I should be allowed to turn it off/on by client as mentioned at #983 (comment), but that's not a big deal
There was a problem hiding this comment.
This is going into a feature branch, I'm turning it on for simplicity, but agree we should not turn on by default in main until everything is known to work as expected
There was a problem hiding this comment.
Yep, but, the goal is to have it on by default when released.
There was a problem hiding this comment.
Immediately upon release? Meaning there's no period where we try it out as opt-in in lang first? When we did this before, users saw unintentional logs and such. I figure we should at least try it in langs before turning it on for everyone in langs.
There was a problem hiding this comment.
Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.
Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.
Sushisource
left a comment
There was a problem hiding this comment.
Nice! I really like the way this is looking despite a bajillion comments. Mostly just polish stuff really. A few logic questions.
| telemetry_options: TelemetryOptions, | ||
| /// Optional worker heartbeat interval - This configures the heartbeat setting of all | ||
| /// workers created using this runtime. | ||
| #[builder(default = Some(Duration::from_secs(60)))] |
There was a problem hiding this comment.
Yep, but, the goal is to have it on by default when released.
| /// Optional worker heartbeat interval - This configures the heartbeat setting of all | ||
| /// workers created using this runtime. | ||
| #[builder(default = Some(Duration::from_secs(60)))] | ||
| heartbeat_interval: Option<Duration>, |
There was a problem hiding this comment.
Well, again, it doesn't mean a new threadpool. We established in my other comment you can re-use the same tokio runtime for different runtimes.
The problem for me is I don't want some weird worker setting showing up in the client crate for people who are only using the client.
One way to address that I suppose is to put that option behind a feature flag that is off by default, but enabled when core brings in the client dep. I'd be OK with that. But, IMO it's mostly a non-issue either way so up to @yuandrew if you want to change it.
| r.expect_sdk_name_and_version() | ||
| .returning(|| ("test-core".to_string(), "0.0.0".to_string())); | ||
| r.expect_get_identity() | ||
| r.expect_identity() |
There was a problem hiding this comment.
Bug: Mock Setup Errors in WorkerClient
The worker_set_key() method, recently added to the WorkerClient trait, has issues in its mock setups. In mock_worker_client(), the expect_worker_set_key() is incorrectly configured, passing a function pointer where a closure is expected. Additionally, mock_manual_worker_client() is missing this expectation entirely, which will cause a runtime panic if the method is called.
| self.all_workers | ||
| .insert(worker.worker_instance_key(), worker); | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
Bug: Worker Registration Inconsistency Causes Heartbeat Issues
The ClientWorkerSetImpl::register method can leave an inconsistent state. slot_providers is updated before the worker is fully registered in all_workers. If a later step, like heartbeat setup, fails, try_reserve_wft_slot may attempt to use a non-existent worker, preventing re-registration for that namespace/task queue and permanently losing the worker's heartbeat callback.
Sushisource
left a comment
There was a problem hiding this comment.
Looking like we're good to go here to me I think. Just a few small things, need to fix these lints, and we're ready! Nice
| #[cfg(test)] | ||
| fn num_providers(&self) -> (usize, usize) { | ||
| (self.index.len(), self.providers.len()) | ||
| (self.slot_providers.len(), self.slot_providers.len()) |
There was a problem hiding this comment.
No reason to return the same number twice here, the few places tests use this can just get updated.
| /// For slot managing, there can only be one worker registered per | ||
| /// namespace+queue_name+client, others will get ignored. |
There was a problem hiding this comment.
Should be changed to say this is an error now
| trace_subscriber: Option<Arc<dyn Subscriber + Send + Sync>>, | ||
| } | ||
|
|
||
| struct WorkerHeartbeatManager { |
There was a problem hiding this comment.
nit: Move this down to live next to the registrator
| telemetry_options: TelemetryOptions, | ||
| /// Optional worker heartbeat interval - This configures the heartbeat setting of all | ||
| /// workers created using this runtime. | ||
| #[builder(default = Some(Duration::from_secs(60)))] |
There was a problem hiding this comment.
Yes. Given the fact that this isn't really user-facing in any visible way ATM, I think the issue is basically zero people are going to flip this on.
Certainly though, @yuandrew , as part of our testing, let's make sure we're not having anything like that happen, including SDK and Server logs.
…nt level (#1038) * Runtime/namespace/client wide worker heartbeat (#983) * worker heartbeat * Address Spencer's comments * wip use client_identity_override as part of key, added test * Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker * Verified client replacement works, need to update tests and cleanup * formating * clean up * forgot to remove new() now that using builder pattern * Switch to worker_set_key * Replace client test passes, need to write unit tests in worker_registry * cargo test-lint * limit nexus to 1 poller, add tests for worker_registry for heartbeat * PR comments * new test helper * Return error on multi worker register for same namespace and task queue on same client * cargo fmt * Fix registration order, unique task queue for test worker * Remove TEST_Q variable * Missing quotes * CI lint and docker test fix, rename worker_set_key to worker_grouping_key * clippy bug * Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data (#1023) * plumb in memory metrics * simplify worker::new(), fix some heartbeat metrics, new test file * CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then() * Support in-mem metrics when metrics aren't configured * Move sys_info refresh to dedicated thread, use tuner's existing sys info * Format, AtomicCell * Fix unit test * Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added * Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test * recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement * MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests * More tests, sticky cache miss, plugins * Formatting, fix skip_client_worker_set_check * Cursor found a bug * Lower sleep time, add print for debugging * more prints * use semaphores for worker_heartbeat_failure_metrics * skip_client_worker_set_check for all integ workers * Can't use tokio semaphore in workflow code * use signal to test workflow_slots.last_interval_failure_tasks * Use Notify instead of semaphores, fix test flake * Use eventually() instead of a manual sleep * max_outstanding_workflow_tasks 2 * merge * Forgot to commit format fixes * Fix test
NOTE: this targets a
worker-heartbeatfeature branch to merge into, that way I can make incremental progress and not hit main until the whole feature is ready for both server and SDK.What was changed
Worker heartbeat duration is a setting configured with
RuntimeOptions, but internally the heartbeat mechanism lives on the client. This way we can properly replace a client theSharedNamespaceWorkeruses when replacing the client of a regular worker.A follow up PR will address filling out the rest of the heartbeat data, most of the remaining pieces require an implementation of storing metrics in memory so we can pull that data on each heartbeat.
Why?
Worker heartbeat. I separated WorkerHeartbeat out of #962. Some of the design for this is with Worker Commands in mind, but that will come at a later time.
Checklist
Closes
How was this tested:
fixed up
worker_heartbeatunit testverified with prints (printed SDK name and version) that replace_client works with heartbeating, but couldn't come up with a good unit/integration test to verify this behavior